Week # - Accessing and Processing GTFS

Bonwoo Koo

2022-07-14

Useful Links:

Environment Setting

# Import required packages
library(tidytransit)
library(tidyverse)
library(tmap)
library(ggplot2)
library(gtfsrouter)
library(here)
library(units)
library(sf)
library(leaflet)
library(tidycensus)
library(plotly)

setwd("D:/Dropbox (GaTech)/Work/Working/School/UA_2022/Lab/W3/")

Read GTFS Feed

Let’s download some GTFS feed provided by Metropolitan Atlanta Rapid Transit Authority (MARTA).

# From ULR
download_dest <- here::here(getwd(), "GTFS_MARTA.zip")
gtfs_local <- download.file("http://www.itsmarta.com/google_transit_feed/google_transit.zip", destfile = download_dest)


atl <- tidytransit::read_gtfs("http://www.itsmarta.com/google_transit_feed/google_transit.zip")
atl_temp <- gtfsrouter::extract_gtfs(download_dest)
## â–¶ Unzipping GTFS archive
✔ Unzipped GTFS archive
## Warning: This feed contains no transfers.txt 
##   A transfers.txt table may be constructed with the 'gtfs_transfer_table' function
## â–¶ Extracting GTFS feed
✔ Extracted GTFS feed 
## â–¶ Converting stop times to seconds
✔ Converted stop times to seconds
# Take a look
summary(atl)

GTFS feed contains many relational tables about transit service schedules, trips, stops, and routes.

Where to get the URL?

Here are some other sources of GTFS feeds:

Transit agencies (e.g., MARTA) may provide their feeds on their website as well. For example, I found the data we are using today by Googling ‘MARTA GTFS’ and got it from this website (look for Official Download URL in the website).

Understand what’s inside atl

typeof(atl)
## [1] "list"
names(atl)
## [1] "agency"         "calendar"       "calendar_dates" "routes"        
## [5] "shapes"         "stop_times"     "stops"          "trips"         
## [9] "."
head(atl$calendar)
## # A tibble: 6 × 10
##   service_id monday tuesday wednesday thursday friday saturday sunday start_date
##   <chr>       <int>   <int>     <int>    <int>  <int>    <int>  <int> <date>    
## 1 20              0       0         0        0      0        0      0 2022-04-23
## 2 28              0       0         0        0      0        0      0 2022-04-23
## 3 39              0       0         0        0      0        0      0 2022-04-23
## 4 24              0       0         0        0      0        0      0 2022-04-23
## 5 25              0       0         0        0      0        0      0 2022-04-23
## 6 26              0       0         0        0      0        0      0 2022-04-23
## # … with 1 more variable: end_date <date>

The atl object is a list. In it, names(atl) shows that there are 9 dataframes. These dataframes can be linked to each other using join keys. The diagram below shows how different tables are linked.

IMAGE SOURCE: http://tidytransit.r-transit.org/articles/introduction.html


The table below shows a brief description of what each dataframe contains. This table is taken from Google.

Filename Defines
agency Transit agencies with service represented in this dataset.
stops Stops where vehicles pick up or drop off riders. Also defines stations and station entrances.
routes Transit routes. A route is a group of trips that are displayed to riders as a single service.
trips Trips for each route. A trip is a sequence of two or more stops that occur during a specific time period.
stop_times Times that a vehicle arrives at and departs from stops for each trip.
calendar Service dates specified using a weekly schedule with start and end dates. This file is required unless all dates of service are defined in calendar_dates.txt.
calendar_dates Exceptions for the services defined in the calendar.txt. If calendar.txt is omitted, then calendar_dates.txt is required and must contain all dates of service.
fare_attributes Fare information for a transit agency’s routes.
fare_rules Rules to apply fares for itineraries.
shapes Rules for mapping vehicle travel paths, sometimes referred to as route alignments.
frequencies Headway (time between trips) for headway-based service or a compressed representation of fixed-schedule service.
transfer Rules for making connections at transfer points between routes.
pathways Pathways linking together locations within stations.
levels Levels within stations.
feed_into Dataset metadata, including publisher, version, and expiration information.
translations Translated information of a transit agency.
attributions Specifies the attributions that are applied to the dataset.

GTFS into geospatial format

The function gtfs_as_sf converts ‘shapes’ and ‘stops’ tables in GTFS data into sf objects.

atlsf <- tidytransit::gtfs_as_sf(atl, crs = 4326)
head(atlsf)
## $agency
## # A tibble: 1 × 7
##   agency_id agency_name      agency_url agency_timezone agency_lang agency_phone
##   <chr>     <chr>            <chr>      <chr>           <chr>       <chr>       
## 1 MARTA     Metropolitan At… https://w… America/New_Yo… en          404-848-5000
## # … with 1 more variable: agency_fare_url <chr>
## 
## $calendar
## # A tibble: 16 × 10
##    service_id monday tuesday wednesday thursday friday saturday sunday
##    <chr>       <int>   <int>     <int>    <int>  <int>    <int>  <int>
##  1 20              0       0         0        0      0        0      0
##  2 28              0       0         0        0      0        0      0
##  3 39              0       0         0        0      0        0      0
##  4 24              0       0         0        0      0        0      0
##  5 25              0       0         0        0      0        0      0
##  6 26              0       0         0        0      0        0      0
##  7 27              0       0         0        0      0        0      0
##  8 29              0       0         0        0      0        0      0
##  9 31              0       0         0        0      0        0      0
## 10 2               0       0         0        0      0        0      0
## 11 3               0       0         0        0      0        1      0
## 12 4               0       0         0        0      0        0      1
## 13 5               1       1         1        1      1        0      0
## 14 21              0       0         0        0      0        0      0
## 15 33              0       0         0        0      0        0      0
## 16 34              0       0         0        0      0        0      0
## # … with 2 more variables: start_date <date>, end_date <date>
## 
## $calendar_dates
## # A tibble: 4 × 3
##   service_id date       exception_type
##   <chr>      <date>              <int>
## 1 34         2022-05-30              1
## 2 5          2022-05-30              2
## 3 29         2022-07-04              1
## 4 5          2022-07-04              2
## 
## $routes
## # A tibble: 118 × 9
##    route_id agency_id route_short_name route_long_name     route_desc route_type
##    <chr>    <chr>     <chr>            <chr>               <chr>           <int>
##  1 16883    MARTA     1                Marietta Blvd/Jose… ""                  3
##  2 16884    MARTA     2                Ponce de Leon Aven… ""                  3
##  3 16886    MARTA     3                Martin Luther King… ""                  3
##  4 16887    MARTA     4                Moreland Avenue     ""                  3
##  5 16889    MARTA     5                Piedmont Road / Sa… ""                  3
##  6 16888    MARTA     6                Clifton Road / Emo… ""                  3
##  7 16890    MARTA     8                North Druid Hills … ""                  3
##  8 16891    MARTA     9                Boulevard / Tilson… ""                  3
##  9 16892    MARTA     12               Howell Mill Road /… ""                  3
## 10 16893    MARTA     14               14th Street / Blan… ""                  3
## # … with 108 more rows, and 3 more variables: route_url <chr>,
## #   route_color <chr>, route_text_color <chr>
## 
## $shapes
## Simple feature collection with 367 features and 1 field
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: -84.67085 ymin: 33.4323 xmax: -84.08274 ymax: 34.10663
## Geodetic CRS:  WGS 84
## First 10 features:
##    shape_id                       geometry
## 1    100095 LINESTRING (-84.45052 33.81...
## 2    100096 LINESTRING (-84.41341 33.73...
## 3    100097 LINESTRING (-84.38689 33.77...
## 4    100098 LINESTRING (-84.31255 33.76...
## 5    100101 LINESTRING (-84.46945 33.75...
## 6    100102 LINESTRING (-84.36823 33.75...
## 7    100103 LINESTRING (-84.35409 33.75...
## 8    100104 LINESTRING (-84.35409 33.75...
## 9    100105 LINESTRING (-84.35498 33.68...
## 10   100106 LINESTRING (-84.36614 33.69...
## 
## $stop_times
## # A tibble: 1,618,393 × 10
##    trip_id arrival_time departure_time stop_id stop_sequence stop_headsign
##    <chr>   <time>       <time>         <chr>           <int> <chr>        
##  1 7142673 06:43        06:43          27                  1 ""           
##  2 7142673 06:46        06:46          28                  2 ""           
##  3 7142673 06:49        06:49          485                 3 ""           
##  4 7142673 06:50        06:50          470                 4 ""           
##  5 7142673 06:51        06:51          796                 5 ""           
##  6 7142673 06:52        06:52          797                 6 ""           
##  7 7142673 06:53        06:53          201                 7 ""           
##  8 7142673 06:55        06:55          798                 8 ""           
##  9 7142673 06:58        06:58          799                 9 ""           
## 10 7142673 07:00        07:00          204                10 ""           
## # … with 1,618,383 more rows, and 4 more variables: pickup_type <int>,
## #   drop_off_type <int>, shape_dist_traveled <dbl>, timepoint <int>
# Interactive mapping
tmap::tmap_mode('view')
## tmap mode set to interactive viewing
m1 <- tmap::tm_shape(atlsf$shapes) + tmap::tm_lines(alpha = 0.5)
m2 <- tmap::tm_shape(atlsf$stops) + tmap::tm_dots(id = 'stop_name', alpha = 0.5)

tmap::tmap_arrange(m1, m2, sync = T)
# If using ggplot2 package
ggplot2::ggplot(data = atlsf$stops) +
  ggplot2::geom_sf()


Calculate travel time

The tidytransit package offers a very convenient function that calculates the shortest travel times from a stop to all other stops. Try ?tidytransit::travel_times to see what this function does and what arguments it takes.

Notice that only two out of nine arguments are required, and the rest are optional. The required arguments are ‘filtered_stop_times’ and ‘stop_name.’ If you do not provide the optional arguments, the function will use the pre-populated values for the calculation.

One of two arguments, filtered_stop_times, can be generated using the function tidytransit::filter_stop_times(). This function

## Filtered stop times

# net <- dodgr_streetnet_sc (pts = atl$stops [, c ("stop_lon", "stop_lat")])
# atl$transfers <- gtfs_transfer_table(atl)
# 
# x <- gtfs_traveltimes (atl_temp,
#                        from = "DUNWOODY STATION",
#                        start_time_limits = c (12, 13) * 3600)
# 
# fst <- filter_stop_times(gtfs_obj = atl, 
#                          extract_date = "2022-04-24", 
#                          min_departure_time = "06:00:00", 
#                          max_arrival_time = "10:00:00")
# 
# # travel_times
# trvt <- travel_times(filtered_stop_times = fst,
#                      stop_name = "MIDTOWN STATION",
#                      return_coords = TRUE)

Calculate area of service by SES

GTFS from MARTA contains information about both rails and buses. The ‘routes’ table in atl has a column named ‘route_type,’ which contains integer values denoting types of transit service. To see what each integer means, you will need to look at the link to the Reference provided above (or see below for a screen capture from the Reference).

Reference from Google

‘routes’ table contains route_types, which is needed for seperating rail transit. ‘shapes’ table contains the shapefile needed for geooperations such as buffer, intersections, etc. Because ‘routes’ and ‘shapes’ table do not contain a common key, they need to be joined through ‘trips’ as intermediary table.

# Filter out rail transit
route_trip <- atl$trips %>% dplyr::left_join(atl$routes, by = "route_id")

unique_shape <- route_trip %>% 
  dplyr::group_by(route_id) %>% 
  dplyr::slice(1) %>% 
  dplyr::ungroup()

route.shape <- atlsf$shape %>% inner_join(unique_shape, by = "shape_id")

# Route type is not really intuitive - let's fix that
route.shape <- route.shape %>% 
  dplyr::mutate(route_type = dplyr::case_when(
    route_type == "0" ~ 'Tram, Streetcar',
    route_type == "1" ~ 'Subway, Metro',
    route_type == "2" ~ 'Rail',
    route_type == "3" ~ 'Bus'
  ))
pal <- leaflet::colorFactor(c("red", "orange", "pink"), domain = route.shape$route_type)

route.shape %>% 
  leaflet::leaflet(data = .) %>% 
    leaflet::addProviderTiles(providers$CartoDB.DarkMatter) %>% 
    leaflet::addPolylines(color = ~pal(route_type), 
                 weight = 3,
                 opacity = 0.9,
                 popup = paste0("Route type: ", route.shape$route_type))

Until Version 1.0, the sf package used ‘equirectangular projection’ for many operations that involve, for example, intersects, intersection, union, nearest_point, st_join, etc. It did so even when the given data is in geographical coordinates.

From Version 1.0, sf package uses S2 spherical geometry for spatial operations, based on S2 geometry library written by Google. For more information on what all of these means, you can read this post.

# If we do not convert CRS to a projected one, the buffer may generate a ragged boundaries. 
# https://r-spatial.github.io/sf/articles/sf7.html#buffers-1
# https://r-spatial.org/r/2020/06/17/s2.html#sf-10-goodbye-flat-earth-welcome-s2-spherical-geometry
sf::sf_use_s2(TRUE)
MARTA_buffer <- route.shape %>% 
  sf::st_transform(crs = 26967) %>% 
  sf::st_buffer(dist = units::set_units(400, "m")) 

MARTA_buffer_group <- MARTA_buffer %>% 
  group_by(route_type) %>% 
  summarise()

Overlaying GTFS with Census

acs2020c <- acs2020 %>% 
  select(GEOID,
         hhinc = hhincE,
         r_tot = r_totE,
         r_wh = r_whE,
         r_bl = r_blE,
         tot_hh = tot_hhE,
         own_novhc = own_novhcE,
         rent_novhc = rent_novhcE) %>% 
  mutate(pct_wh = r_wh / r_tot,
         pct_bl = r_bl / r_tot,
         pct_novhc = (own_novhc + rent_novhc)/tot_hh) %>% 
  mutate(area1 = unclass(st_area(.))) %>% 
  st_transform(26967) %>% 
  mutate(area2 = unclass(st_area(.))) %>% 
  st_transform(crs = 4326) %>% 
  mutate(ln_pop_den = log((r_tot / (area1/1000^2)) + 1)) %>% 
  filter(!is.na(hhinc), !is.na(r_tot), !is.na(own_novhc))

pal1 <- colorNumeric(palette = "YlOrRd", domain = acs2020c$hhinc)
pal2 <- colorFactor("Spectral", domain = MARTA_buffer_group$route_type)

leaflet() %>% 
  addProviderTiles(providers$CartoDB.DarkMatter) %>% 
  addPolygons(group = "ACS",
              data = acs2020c, 
              color = "grey", 
              fillColor = ~pal1(hhinc), 
              fillOpacity = 0.5,
              weight = 1, 
              popup = leafpop::popupTable(round(st_drop_geometry(acs2020c[,c("hhinc", "pct_wh", "pct_bl")]),2))) %>% 
  addPolygons(group = "MARTA",
              data = MARTA_buffer_group %>% 
                st_transform(crs = st_crs(acs2020c)), 
              color = ~pal2(route_type),
              weight = 1,
              opacity = 0.9) %>% 
  addLayersControl(
    overlayGroups = c("ACS", "MARTA"),
    options = layersControlOptions(collapsed = FALSE)
  )
# Intersect buffer with tract 
buffer_per_tract1 <- acs2020c %>% 
  st_transform(crs = 26967) %>% 
  st_intersection(MARTA_buffer_group) %>% 
  filter(route_type == "Bus") %>% 
  # After Intersection
  mutate(subarea = unclass(st_area(.)),
         pct_served1 = subarea/area1,
         pct_served2 = subarea/area2) %>% 
  st_transform(crs = 4326)
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
buffer_per_tract2 <- acs2020c %>% 
  left_join(buffer_per_tract1 %>% select(GEOID, pct_served1) %>% st_set_geometry(NULL),
            by = "GEOID") %>% 
  # There are many NAs in the pct_served1 column
  mutate(pct_served1 = case_when(is.na(pct_served1) ~ 0,
                                 TRUE ~ pct_served1))

pal <- colorNumeric(palette = "YlOrRd", domain = buffer_per_tract2$pct_served1)

leaflet() %>% 
  addProviderTiles(providers$CartoDB.DarkMatter) %>% 
  addPolygons(data = buffer_per_tract2, 
              fillColor = ~pal(pct_served1), fillOpacity = 0.9, 
              color = "white", opacity = 0.2, 
              weight = 1) %>% 
  addLegend("bottomright",
            pal = pal,
            values = buffer_per_tract2$pct_served1,
            title = "% Area within 400m from transit line") 

Be careful when calculating area. tm_shape( buffer_per_tract %>% mutate(diff = (area2 - area1)/area2)) + tm_polygons(col = 'diff', style = 'quantile')

# Is there any correlation between SES and service area?
var_name <- list(
  'hhinc'="Annual Household Income",
  'pct_bl'="% Black population",
  'pct_novhc'="% Household without vehicle",
  'pct_wh'="% White population",
  'r_tot'="Total population",
  'ln_pop_den' = "Population density"
)

var_labeller <- function(variable, value){
  return(var_name[value])
}

na_index <- buffer_per_tract2 %>% 
  select(hhinc, r_tot, ln_pop_den, starts_with('pct')) %>% 
  is.na(.) %>% 
  apply(., 1, function(x) sum(x)==0)

buffer_per_tract2 %>% 
  filter(na_index) %>% 
  pivot_longer(cols = c('hhinc', 'pct_wh', 'pct_bl', 'pct_novhc', "r_tot", "ln_pop_den"), names_to = "variable", values_to = "value") %>% 
  ggplot() +
  geom_point(aes(x = pct_served1, y = value), alpha = 0.4) +
  facet_wrap(~variable, scales = "free_y", labeller = var_labeller) +
  labs(x = "% Area with 400m beffer from transit line") +
  theme_bw()
## Warning: The labeller API has been updated. Labellers taking `variable` and
## `value` arguments are now deprecated. See labellers documentation.

test_var <- c("hhinc", "pct_bl", "pct_wh", "pct_novhc", "r_tot", "ln_pop_den")

map_chr(test_var, function(x){
  cor.object <- cor.test(buffer_per_tract2[['pct_served1']], 
                         buffer_per_tract2[[x]])
  r <- round(cor.object$estimate,3)
  t <- round(cor.object$statistic,3)
  p <- round(cor.object$p.value,3)
  return(paste0("for ",x ,", r=", r, " (t=", t, ", p=", p, ")"))
})
## [1] "for hhinc, r=-0.404 (t=-10.715, p=0)"    
## [2] "for pct_bl, r=0.105 (t=2.559, p=0.011)"  
## [3] "for pct_wh, r=-0.112 (t=-2.739, p=0.006)"
## [4] "for pct_novhc, r=0.527 (t=15.039, p=0)"  
## [5] "for r_tot, r=-0.305 (t=-7.768, p=0)"     
## [6] "for ln_pop_den, r=0.531 (t=15.221, p=0)"

Reflecting the frequency differences

# Extract MARTA rail route_id
rail_route_id <- atl$routes %>% filter(route_type == '1') %>% .[["route_id"]]

# trip_id for rails
rail_trip_id <- atl$trips %>% filter(route_id %in% rail_route_id) %>% .[['trip_id']]

# stop_id for rails
rail_stop_id <- atl$stop_times %>% filter(trip_id %in% rail_trip_id) %>% with(unique(.[["stop_id"]]))

# calculate the total number of trips passed through each stop
n_per_stop <- atlsf$stops %>% 
  left_join(
    atl$stop_times %>%
      group_by(stop_id) %>% 
      summarise(pass_count = n()) %>% 
      ungroup() %>% 
      filter(pass_count < 2000),
    by = "stop_id"
    ) %>% 
  mutate(stop_type = case_when(
    stop_id %in% rail_stop_id ~ "Rail",
    TRUE ~ "Others"
  ))

pal <- colorQuantile(palette = "YlOrRd", domain = n_per_stop$pass_count)

# Merge it with 'stop' geometry for visualization
rail_stops <- n_per_stop %>% filter(stop_type == "Rail")
other_stops <- n_per_stop %>% filter(stop_type == "Others")

n_per_stop_rail <- leaflet() %>% 
    addProviderTiles(providers$CartoDB.DarkMatter) %>% 
    addCircles(data = rail_stops,
               color = ~pal(pass_count),
               weight = 6,
               opacity = 0.5,
               popup = as.character(paste0("Count: ", rail_stops$pass_count, ", ID: ", rail_stops$stop_id)))

n_per_stop_others <- leaflet() %>% 
  addProviderTiles(providers$CartoDB.DarkMatter) %>% 
  addCircles(data = other_stops,
               color = ~pal(pass_count),
               weight = 6,
               opacity = 0.5,
               popup = as.character(paste0("Count: ", other_stops$pass_count, ", ID: ", other_stops$stop_id))) %>% 
  addLegend("bottomright", 
            pal = pal, 
            values = n_per_stop$pass_count,
            title = "Total Stops",
            opacity = 1)

leafsync::sync(n_per_stop_rail, n_per_stop_others)

Merge frequency by stop information with ACS data

stop_tract <- acs2020c %>% 
  st_transform(crs = 4326) %>% 
  st_join(n_per_stop) %>% 
  group_by(GEOID) %>% 
  summarise(stop_count = sum(!is.na(pass_count)),
            pass_count = sum(pass_count, na.rm = T),
            hhinc = mean(hhinc),
            r_tot = mean(r_tot),
            pct_wh = mean(pct_wh),
            pct_bl = mean(pct_bl),
            pct_novhc = mean(pct_novhc),
            r_tot = mean(r_tot),
            ln_pop_den = mean(ln_pop_den))

stop_tract
## Simple feature collection with 591 features and 9 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.85071 ymin: 33.35246 xmax: -84.02371 ymax: 34.18629
## Geodetic CRS:  WGS 84
## # A tibble: 591 × 10
##    GEOID    stop_count pass_count hhinc r_tot pct_wh pct_bl pct_novhc ln_pop_den
##    <chr>         <int>      <int> <dbl> <dbl>  <dbl>  <dbl>     <dbl>      <dbl>
##  1 1306304…         30       3506 35530  2832 0.0498  0.792    0.189        5.61
##  2 1306304…          6        561 39500  3911 0.0353  0.867    0.0796       7.20
##  3 1306304…         12       1506 44158  5045 0.0200  0.920    0.133        7.51
##  4 1306304…         16       2481 38227  5896 0.217   0.588    0.117        6.94
##  5 1306304…         16       1527 34667  4465 0.127   0.712    0.0360       6.93
##  6 1306304…          0          0 51039  4393 0.327   0.340    0.0715       7.26
##  7 1306304…         48       8258 24552  3941 0.249   0.558    0.229        6.06
##  8 1306304…         20       3050 32202  3048 0.306   0.648    0.152        7.06
##  9 1306304…         15       2797 42983  3903 0.390   0.463    0.161        6.95
## 10 1306304…         30       4232 48866  3927 0.220   0.425    0.0338       6.05
## # … with 581 more rows, and 1 more variable: geometry <POLYGON [°]>
## Warning: The labeller API has been updated. Labellers taking `variable` and `value` arguments are now deprecated. See labellers documentation.
## The labeller API has been updated. Labellers taking `variable` and `value` arguments are now deprecated. See labellers documentation.

plotly::ggplotly(stop_count)
plotly::ggplotly(pass_count)

Open Street Map

library(osmdata)
## Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright
# Get bounding box coordinates for Atlanta
bb <- getbb('Atlanta, GA')

# To see where bb covers
bb %>% t %>% data.frame() %>% 
  st_as_sf(coords = c("x", "y"), crs = 4326) %>% 
  st_bbox() %>% 
  st_as_sfc() %>% 
  tm_shape() + tm_borders()

OSM wiki provides a detailed description on various ‘key-value’ pairs.

# Get OSM road data
osm_road <- opq(bbox = bb) %>%
  add_osm_feature(key = 'highway', 
                  value = c('motorway','primary','secondary', 'teriary', 
                            'unclassified', 'residential', 'living_street')) %>%
  osmdata_sf()

names(osm_road)
## [1] "bbox"              "overpass_call"     "meta"             
## [4] "osm_points"        "osm_lines"         "osm_polygons"     
## [7] "osm_multilines"    "osm_multipolygons"
tm_shape(osm_road$osm_lines) + tm_lines(col = "highway")
table(osm_road$osm_lines$highway)
## 
## living_street      motorway       primary   residential     secondary 
##            17           797           741         14742          2482 
##  unclassified 
##           390